15 research outputs found
Small data: practical modeling issues in human-model -omic data
This thesis is based on the following articles:
Chapter 2: Holsbø, E., Perduca, V., Bongo, L.A., Lund, E. & Birmelé, E. (Manuscript). Stratified time-course gene preselection shows a pre-diagnostic transcriptomic signal for metastasis in blood cells: a proof of concept from the NOWAC study. Available at https://doi.org/10.1101/141325.
Chapter 3: Bøvelstad, H.M., Holsbø, E., Bongo, L.A. & Lund, E. (Manuscript). A Standard Operating Procedure For Outlier Removal In Large-Sample Epidemiological Transcriptomics Datasets. Available at https://doi.org/10.1101/144519.
Chapter 4: Holsbø, E. & Perduca, V. (2018). Shrinkage estimation of rate statistics. Case Studies in Business, Industry and Government Statistics 7(1), 14-25. Also available at http://hdl.handle.net/10037/14678.Human-model data are very valuable and important in biomedical research. Ethical and economical constraints limit the access to such data, and consequently these datasets rarely comprise more than a few hundred observations. As measurements are comparatively cheap, the tendency is to measure as many things as possible for the few, valuable participants in a study. With -omics technologies it is cheap and simple to make hundreds of thousands of measurements simultaneously. This few observations–many measurements setting is a high-dimensional problem in the technical language. Most gene expression experiments measure the expression levels of 10 000–15 000 genes for fewer than 100 subjects. I refer to this as the small data setting.
This dissertation is an exercise in practical data analysis as it happens in a large epidemiological cohort study. It comprises three main projects: (i) predictive modeling of breast cancer metastasis from whole-blood transcriptomics measurements; (ii) standardizing a microarray data quality assessment in the Norwegian Women and Cancer (NOWAC) postgenome cohort; and (iii) shrinkage estimation of rates. These three are all small data analyses for various reasons.
Predictive modeling in the small data setting is very challenging. There are several modern methods built to tackle high-dimensional data, but there is a need to evaluate these methods against one another when analyzing data in practice. Through the metastasis prediction work we learned first-hand that common practices in machine learning can be inefficient or harmful, especially for small data. I will outline some of the more important issues.
In a large project such as NOWAC there is a need to centralize and disseminate knowledge and procedures. The standardization of NOWAC quality assessment was a project born of necessity. The standard operating procedure for outlier removal was developed so that preprocessing of the NOWAC microarray material should happen in the same way every time. We take this procedure from an archaic R-script that resided in peoples email inboxes to a well-documented, open-source R-package and present the NOWAC guidelines for microarray quality control. The procedure is built around the inherent high value of a singleobservation.
Small data are plagued by high variance. Working with small data it is usually profitable to bias models by shrinkage or borrowing of information from elsewhere. We present a pseudo-Bayesian estimator of rates in an informal crime rate study. We exhibit the value of such procedures in a small data setting and demonstrate some novel considerations about the coverage properties of such a procedure.
In short I gather some common practices in predictive modeling as applied to small data and assess their practical implications. I argue that with more focus on human-based datasets in biomedicine there is a need for particular consideration of these data in a small data paradigm to allow for reliable analysis. I will present what I believe to be sensible guidelines
What is the state of the art? Accounting for multiplicity in machine learning benchmark performance
Machine learning methods are commonly evaluated and compared by their
performance on data sets from public repositories. This allows for multiple
methods, oftentimes several thousands, to be evaluated under identical
conditions and across time. The highest ranked performance on a problem is
referred to as state-of-the-art (SOTA) performance, and is used, among other
things, as a reference point for publication of new methods. Using the
highest-ranked performance as an estimate for SOTA is a biased estimator,
giving overly optimistic results. The mechanisms at play are those of
multiplicity, a topic that is well-studied in the context of multiple
comparisons and multiple testing, but has, as far as the authors are aware of,
been nearly absent from the discussion regarding SOTA estimates. The optimistic
state-of-the-art estimate is used as a standard for evaluating new methods, and
methods with substantial inferior results are easily overlooked. In this
article, we provide a probability distribution for the case of multiple
classifiers so that known analyses methods can be engaged and a better SOTA
estimate can be provided. We demonstrate the impact of multiplicity through a
simulated example with independent classifiers. We show how classifier
dependency impacts the variance, but also that the impact is limited when the
accuracy is high. Finally, we discuss a real-world example; a Kaggle
competition from 2020
Convolutional neural network for breathing phase detection in lung sounds
We applied deep learning to create an algorithm for breathing phase detection
in lung sound recordings, and we compared the breathing phases detected by the
algorithm and manually annotated by two experienced lung sound researchers. Our
algorithm uses a convolutional neural network with spectrograms as the
features, removing the need to specify features explicitly. We trained and
evaluated the algorithm using three subsets that are larger than previously
seen in the literature. We evaluated the performance of the method using two
methods. First, discrete count of agreed breathing phases (using 50% overlap
between a pair of boxes), shows a mean agreement with lung sound experts of 97%
for inspiration and 87% for expiration. Second, the fraction of time of
agreement (in seconds) gives higher pseudo-kappa values for inspiration
(0.73-0.88) than expiration (0.63-0.84), showing an average sensitivity of 97%
and an average specificity of 84%. With both evaluation methods, the agreement
between the annotators and the algorithm shows human level performance for the
algorithm. The developed algorithm is valid for detecting breathing phases in
lung sound recordings
More efficient manual review of automatically transcribed tabular data
Machine learning methods have proven useful in transcribing historical data.
However, results from even highly accurate methods require manual verification
and correction. Such manual review can be time-consuming and expensive,
therefore the objective of this paper was to make it more efficient.
Previously, we used machine learning to transcribe 2.3 million handwritten
occupation codes from the Norwegian 1950 census with high accuracy (97%). We
manually reviewed the 90,000 (3%) codes with the lowest model confidence. We
allocated those 90,000 codes to human reviewers, who used our annotation tool
to review the codes. To assess reviewer agreement, some codes were assigned to
multiple reviewers. We then analyzed the review results to understand the
relationship between accuracy improvements and effort. Additionally, we
interviewed the reviewers to improve the workflow. The reviewers corrected
62.8% of the labels and agreed with the model label in 31.9% of cases. About
0.2% of the images could not be assigned a label, while for 5.1% the reviewers
were uncertain, or they assigned an invalid label. 9,000 images were
independently reviewed by multiple reviewers, resulting in an agreement of
86.43% and disagreement of 8.96%. We learned that our automatic transcription
is biased towards the most frequent codes, with a higher degree of
misclassification for the lowest frequency codes. Our interview findings show
that the reviewers did internal quality control and found our custom tool
well-suited. So, only one reviewer is needed, but they should report
uncertainty.Comment: 19 pages, 5 figures, 1 tabl
Occode: an end-to-end machine learning pipeline for transcription of historical population censuses
Machine learning approaches achieve high accuracy for text recognition and
are therefore increasingly used for the transcription of handwritten historical
sources. However, using machine learning in production requires a streamlined
end-to-end machine learning pipeline that scales to the dataset size, and a
model that achieves high accuracy with few manual transcriptions. In addition,
the correctness of the model results must be verified. This paper describes our
lessons learned developing, tuning, and using the Occode end-to-end machine
learning pipeline for transcribing 7,3 million rows with handwritten occupation
codes in the Norwegian 1950 population census. We achieve an accuracy of 97%
for the automatically transcribed codes, and we send 3% of the codes for manual
verification. We verify that the occupation code distribution found in our
result matches the distribution found in our training data which should be
representative for the census as a whole. We believe our approach and lessons
learned are useful for other transcription projects that plan to use machine
learning in production. The source code is available at:
https://github.com/uit-hdl/rhd-code
Large Multiples : exploring the large-scale scattergun approach to visualization and analysis
We create 2.5 quintillion bytes of data every day. A whole 90% of the world’s data was created in the last two years.1 One contribution to this massive bulk of data is Twitter: Twitter users create 500 million tweets a day,2 which fact has greatly impacted social science [24] and journalism [39].
Network analysis is important in social science [6], but with so much data there is a real danger of information overload, and there is a general need for tools that help users navigate and make sense of this.
Data exploration is one way of analyzing a data set. Exploration-based analysis is to let the data suggest hypotheses, as opposed to starting out with a hypothesis to either confirm or refute. Visualization is an important exploration tool.
Given the ready availability of large-scale displays [1], we believe that an ideal visual exploration system would leverage these, and leverage the fact that there are many different ways to visualize something. We propose to use wall- sized displays to provide many different views of the same data set and as such let the user explore the data by exploring visualizations. Our thesis is that a display wall architecture [1, 42] is an ideal platform for such a scheme, providing both the resolution and the compute power required. Proper utilization of this would allow for useful sensemaking and storytelling.
To evaluate our thesis we have built a system for gathering and analyzing Twitter data, and exploring it through multiple visualizations.
Our evaluation of the prototype has provided us with insights that will allow us to create a practicable system, and demonstrations of the prototype has uncovered interesting stories in our case study data set. We find that it is strictly necessary to use clever pre-computation, or pipelining, or streaming to meet the strict latency requirements of providing visualization interactively fast.
Our further experiments with the system have led to new discoveries in streaming graph processing
Small data: practical modeling issues in human-model -omic data
Human-model data are very valuable and important in biomedical research. Ethical and economical constraints limit the access to such data, and consequently these datasets rarely comprise more than a few hundred observations. As measurements are comparatively cheap, the tendency is to measure as many things as possible for the few, valuable participants in a study. With -omics technologies it is cheap and simple to make hundreds of thousands of measurements simultaneously. This few observations–many measurements setting is a high-dimensional problem in the technical language. Most gene expression experiments measure the expression levels of 10 000–15 000 genes for fewer than 100 subjects. I refer to this as the small data setting.
This dissertation is an exercise in practical data analysis as it happens in a large epidemiological cohort study. It comprises three main projects: (i) predictive modeling of breast cancer metastasis from whole-blood transcriptomics measurements; (ii) standardizing a microarray data quality assessment in the Norwegian Women and Cancer (NOWAC) postgenome cohort; and (iii) shrinkage estimation of rates. These three are all small data analyses for various reasons.
Predictive modeling in the small data setting is very challenging. There are several modern methods built to tackle high-dimensional data, but there is a need to evaluate these methods against one another when analyzing data in practice. Through the metastasis prediction work we learned first-hand that common practices in machine learning can be inefficient or harmful, especially for small data. I will outline some of the more important issues.
In a large project such as NOWAC there is a need to centralize and disseminate knowledge and procedures. The standardization of NOWAC quality assessment was a project born of necessity. The standard operating procedure for outlier removal was developed so that preprocessing of the NOWAC microarray material should happen in the same way every time. We take this procedure from an archaic R-script that resided in peoples email inboxes to a well-documented, open-source R-package and present the NOWAC guidelines for microarray quality control. The procedure is built around the inherent high value of a singleobservation.
Small data are plagued by high variance. Working with small data it is usually profitable to bias models by shrinkage or borrowing of information from elsewhere. We present a pseudo-Bayesian estimator of rates in an informal crime rate study. We exhibit the value of such procedures in a small data setting and demonstrate some novel considerations about the coverage properties of such a procedure.
In short I gather some common practices in predictive modeling as applied to small data and assess their practical implications. I argue that with more focus on human-based datasets in biomedicine there is a need for particular consideration of these data in a small data paradigm to allow for reliable analysis. I will present what I believe to be sensible guidelines
Metastatic Breast Cancer and Pre-Diagnostic Blood Gene Expression Profiles—The Norwegian Women and Cancer (NOWAC) Post-Genome Cohort
Breast cancer patients with metastatic disease have a higher incidence of deaths from breast cancer than patients with early-stage cancers. Recent findings suggest that there are differences in immune cell function between metastatic and non-metastatic cases, even years before diagnosis. We have analyzed whole blood gene expression by Illumina bead chips in blood samples taken using the PAXgene blood collection system up to two years before diagnosis. The final study sample included 197 breast cancer cases and 197 age-matched controls. We defined a causal directed acyclic graph to guide a Bayesian data analysis to estimate the risk of metastasis associated with the expression of all genes and with relevant sets of genes. We ranked genes and gene sets according to the sign probability for excess risk. Among the screening detected cancers, 82% were without metastasis, compared to 53% of between-screening detected cancers. Among the highest ranking genes and gene sets associated with metastasis risk, we identified plasmacytiod dentritic cell function, the SLC22 family of transporters, and glutamine metabolism as potential links between the immune system and metastasis. We conclude that there may be potentially wide-reaching differences in blood gene expression profiles between metastatic and non-metastatic breast cancer cases up to two years before diagnosis, which warrants future study
Shrinkage estimation of rate statistics
This paper presents a simple shrinkage estimator of rates based on Bayesian methods. Our focus is on crime rates as a motivating example. The estimator shrinks each town’s observed crime rate toward the country-wide average crime rate according to town size. By realistic simulations we confirm that the proposed estimator outperforms the maximum likelihood estimator in terms of global risk. We also show that it has better coverage properties